Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC
نویسندگان
چکیده
Vocal Tract Length Normalization (VTLN) for standard filterbank-based Mel Frequency Cepstral Coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion (Lee and Rose, 1998). A linear transform (LT) equivalent for frequency warping (FW) would enable more efficient MLS estimation (Umesh et al., 2005). We recently proposed a novel LT to perform FW for VTLN and model adaptation with standard MFCC features (Panchapagesan, 2006). In this paper, we present the mathematical derivation of the LT and give a compact formula to calculate it for any FW function. We also show that our LT is very closely related to previously proposed LTs for FW (McDonough, 2000; Pitz et al., 2001; Umesh et al., 2005), and these LTs for FW are all found to be numerically almost identical for the sine-log all-pass transform (SLAPT) warping functions. Our formula for the transformation matrix is, however, computationally simpler and unlike other previous linear transform approaches to VTLN with MFCC features (Pitz and Ney, 2003; Umesh et al., 2005), no modification of the standard MFCC feature extraction scheme is required. In VTLN and Speaker Adaptive Modeling (Welling et al., 2002) experiments with the DARPA Resource Management (RM1) database, the performance of the new LT was comparable to that of regular VTLN implemented by warping the Mel filterbank, when the MLS criterion was used for FW estimation. This demonstrates that the approximations involved do not lead to any performance degradation. Performance comparable to front end VTLN was also obtained with LT adaptation of HMM means in the back end, combined with mean bias and variance adaptation according to the Maximum Likelihood Linear Regression (MLLR) framework. The FW methods performed significantly better than standard MLLR for very limited adaptation data (1 utterance), and were equally effective with unsupervised parameter estimation. We also performed Speaker Adaptive Training (SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave results comparable to SAM and VTLN. By estimating multiple CLTFW transforms using a regression tree, and including an additive bias, we obtained significantly improved results compared to VTLN, with increasing adaptation data. Preprint submitted to Elsevier 26 June 2008
منابع مشابه
Frequency warping by linear transformation of standard MFCC
A novel linear transform (LT) is proposed for frequency warping (FW) with standard filterbank based MFCC features. Here, we use the idea of spectral interpolation of [9] to perform a continuous warping in the log filterbank output domain, and incorporate both interpolation and warping into a single warped IDCT matrix. The new transformation matrix is thus mathematically simpler than in [9], and...
متن کاملMLLR-like speaker adaptation based on linearization of VTLN with MFCC features
In this paper, an MLLR-like adaptation approach is proposed whereby the transformation of the means is performed deterministically based on linearization of VTLN. Biases and adaptation of the variances are estimated statistically by the EM algorithm. In the discrete frequency domain, we show that under certain approximations, frequency warping with Mel-£lterbank-based MFCCs equals a linear tran...
متن کاملImplementing frequency-warping and VTLN through linear transformation of conventional MFCC
In this paper, we show that frequency-warping (including VTLN) can be implemented through linear transformation of conventional MFCC. Unlike the Pitz-Ney [1] continuous domain approach, we directly determine the relation between frequency-warping and the linear-transformation in the discrete-domain. The advantage of such an approach is that it can be applied to any frequency-warping and is not ...
متن کاملUsing VTLN matrices for rapid and computationally-efficient speaker adaptation with robustness to first-pass transcription errors
In this paper, we propose to combine the rapid adaptation capability of conventional Vocal Tract Length Normalization (VTLN) with the computational efficiency of transform-based adaptation such as MLLR or CMLLR. VTLN requires the estimation of only one parameter and is, therefore, most suited for the cases where there is little adaptation data (i.e. rapid adaptation). In contrast, transform-bas...
متن کاملLinear transformation approach to VTLN using dynamic frequency warping
In the paper, we present a novel linear transformation approach to frequency warping during vocal tract length normalisation(VTLN) using the idea of dynamic frequency warping(DFW). Linear transformation among the mel-frequency cepstral coefficients (MFCC) provides computational advantage of not having to recompute features for each warp factor in VTLN. The proposed method uses the idea of separ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computer Speech & Language
دوره 23 شماره
صفحات -
تاریخ انتشار 2009